Systematic analysis of group identification in stock markets 
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We propose improved methods to identify stock groups using the correlation matrix of stock price 
changes. By filtering out the marketwide effect and the random noise, we construct the correlation 
matrix of stock groups in which nontrivial high correlations between stocks are found. Using the 
filtered correlation matrix, we successfully identify the multiple stock groups without any extra 
knowledge of the stocks by the optimization of the matrix representation and the percolation ap- 
proach to the correlation-based network of stocks. These methods drastically reduce the ambiguities 
while finding stock groups using the eigenvectors of the correlation matrix. 
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I. INTRODUCTION 

The study of correlations in stock markets has at- 
tracted much interest of physicists because of its chal- 
lenging complexity as a complex system and its possible 
future applications to the real markets pj. In the early 
years, a correlation-based taxonomy of stocks and stock 
market indices was studied by the method of the hierar- 
chical tree 0, ■ Recently, the minimum spanning tree 
technique was introduced to study the structure and dy- 
namics of the stock network Q,|^,|^, the random matrix 
theory was applied to find out the difference between 
the random and nonrandom property of the correlations 
0- El EH EH]- and the maximum likelihood cluster- 
ing method was developed and applied to identify clus- 
ter structures in stock markets |l2j. Also, these studies 
have been extended to the applications to the portfolio 
optimization in real market [a, @ • 

Commonly, the correlation between stocks is expressed 
by the Pearson correlation coefficient of log-returns, 



Gi(t) = In $ (t + At) - In Si (t), 



(1) 



where Si(t) is the price of stock i at time t. From real 
time series data of N stock prices, we can calculate the 
element of N x N correlation matrix C as following 



Cij — 



((Gjjt) - (Gj))(Gj(t) - (Gj))) 
^((G?) - <Gi) 2 )((G|) - (Gj) 2 ) 



(2) 



where (• • • ) indicates time averages over the period of the 
time series. By definition, Ca = 1 and Cij has a value in 

[-1,1]- 

Laloux et al. and Plerou et al. [H studied the 
statistical properties of an empirical correlation matrix 
between stock price changes defined in Eq. for real 
markets. In comparison with the prediction of the ran- 
dom matrix theory, they found that the statistics of the 



bulk eigenvalues are in remarkable agreements with the 
universal properties of the random correlation matrix. 
For example, the bulk part of the eigenvalue spectrum 
of the empirical correlation matrix for TV stocks over L 
price data has the form of the spectrum of the random 
correlation matrix [T^ | which is given by 



P(A) 



max A) (A - A 



2tt 



a 



(3) 
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for A G [A m m,A maa; ] in the limit of N, L — > oo with 
fixed Q = L/N, where X ma x = (1 + l/V^) 2 and 
\nin — (1 — l/VQ) 2 - Moreover, the level spacing statis- 
tics of eigenvalues exhibits good agreement with the re- 
sults from the Gaussian orthogonal ensemble of random 
matrices H,0- 

On the other hand, the nonrandom properties of the 
correlation matrix have also been studied with the em- 
pirical correlation matrix 0, IToj . From the empirical 
data for the New York Stock Exchange, it was found that 
each eigenvector corresponding to the few largest eigen- 
values larger than the upper bound of the bulk eigenvalue 
spectrum, is localized, in a sense that only a few com- 
ponents contribute to the eigenvector mostly, and the 
stocks corresponding to those dominant components of 
the eigenvector are found to belong to a common indus- 
try sector. Very recently, Utsugi et al. confirmed and 
improved those results through the similar analysis for 
the Tokyo Stock Exchange pd| . 

In order to confirm the localization property of eigen- 
vectors, weperform the similar analysis to the previous 
studies H, 0, on eigenvectors of the correlation ma- 
trix using our own dataset of stock prices. We analyze the 
daily prices of N — 135 stocks belonging to the New York 
Stock Exchange (NYSE) for the 20-year period 1983 - 
2003 (L ~ 5000 trading days) which is publicly avail- 
able from the web-site( http: / / finan ce.yahoo.com| | [l4|. 
Indeed, if we put stocks in the order of their industrial 
sectors, we observe that the eigenvector components cor- 
responding to stocks which belong to specific industrial 
sectors give high contributions to each of the eigenvec- 
tors for the few largest eigenvalues (see Fig. For in- 
stance, the stocks belonging to the energy, technology, 
transportation, and utilities sectors highly contribute to 
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FIG. 1: The normalized eigenvector components Mi (A) of 
stock i corresponding to the second to fifth largest eigenval- 
ues Ai - A4 of the correlation matrix. The stocks are sorted 
by industrial sectors, A: basic materials, B: capital goods, 
C: conglomerates, D: consumer (cyclical), E: consumer (non- 
cyclical), F: energy, G: financial, H: healthcare, I: services, 
J: technology, K: transportation, and L: utilities, which are 
separated by dashed lines. 



finally find out relevant stock groups without any aid of 
the table of industrial sectors. 

In this paper, we introduce the improved method to 
identify stock groups which drastically reduce the ambi- 
guities in finding multiple groups using eigenvectors of 
the correlation matrix. We first filter out the random 
noise and the marketwide effect from the correlation ma- 
trix. With the filtered correlation matrix, we apply op- 
timization and percolation approaches to find the stock 
groups. Through the optimization of the stock sequences 
representing the matrix indices, the filtered correlation 
matrix is transformed into the block diagonal matrix in 
which all stocks in a block are found to belong to the 
same group. By constructing a network of stocks using 
the percolation approach on the filtered correlation ma- 
trix, we also successfully identify the stock groups which 
appear in the form of isolated clusters in the resulting 
network. 

This paper is organized as follows. In Sec. [B] the de- 
tailed filtering method to construct the group correlation 
matrix is given. For the filtering, the largest eigenvalue 
and the corresponding eigenvector are required and they 
are calculated from the first-order perturbation theory. 
In Sec. 11111 detailed stock group finding methods using 
the optimization and the percolation are given and the 
resulting stock groups are specified. In Sec. IIVI a sum- 
mary and conclusions are presented. 



II. GROUP CORRELATION MATRIX 



the eigenvector for the second largest eigenvalue; the en- 
ergy sector constitutes the big part of the eigenvector 
for the third largest eigenvalue; the fourth largest eigen- 
value gives the eigenvector localized on the basic ma- 
terials, consumer (noncyclical), healthcare, and utilities 
sectors; the eigenvector for the fifth largest eigenvalue is 
also localized on several specific industrial sectors. 

However, it is not straightforward to find out specific 
stock groups, such as the industrial sectors, inversely. If 
each of the eigenvectors had well-defined dominant com- 
ponents and the corresponding set of stocks were inde- 
pendent of the sets from other eigenvectors, it would be- 
come easy to identify the stock groups. Unfortunately, 
in our study, it turns out that not only the set of eigen- 
vector components with dominant contribution can be 
hardly defined in the eigenvector but also such a set is 
likely to overlap with the sets from other eigenvectors 
unless we pick a very small number of stocks with few 
highest ranks of their contributions to the eigenvectors; 
Figure ^ indicates that each of the eigenvectors is local- 
ized on a multiple number of industrial sectors and the 
corresponding stocks severely overlap with those from 
the other eigenvectors. Therefore it is very ambiguous 
to identify the stock groups for practical purposes. The 
aim of this study is to get rid of these ambiguities and 



A. Filtering 

The group of stocks is defined as a set of highly inter- 
correlated stocks in their price changes. In the empirical 
correlation matrix, because several types of noises are ex- 
pected to coexist with the intragroup correlations, it is 
essential to filter out such noises to isolate the intragroup 
correlations which we are interested in. With the com- 
plete set of eigenvalues and eigenvectors, the correlation 
matrix in Eq. (J2J can be expanded as 

N-l 

C= J2 X «\a)(a\, (4) 

where A Q is the eigenvalue sorted in descending order and 
I a) is the corresponding eigenvector. Because only the 
eigenvectors corresponding to the few largest eigenval- 
ues are believed to contain the information on significant 
stock groups, we can identify a filtered correlation matrix 
for stock groups by choosing a partial sum of A a |cv)(a| 
relevant to stock groups, which we will call the group 
correlation matrix, C 9 . 

In order to extract C 9 from the correlation matrix, 
taking the previous results of Plerou et al. @, H 
for granted, we posit that the eigenvalue spectrum of the 
correlation matrix is organized by the marketwide part 
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FIG. 2: (a) The eigenvalues A > 1.0 of the correlation ma- 
trix C and (b) the distribution of bulk eigenvalues P{\) (solid 
line). The dashed-dot line marks our boundary between the 
random noise part and the group correlation part, (c) The 
matrix element distribution for the group correlation matrix 
C 9 and the residual parts corresponding to the bulk eigenval- 
ues C r and the largest eigenvalue C"\ 
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FIG. 3: The Q = L/N dependence of the matrix element 
distribution for C 9 (thick solid line), C r (solid line), and C m 
(dashed line). With fixed TV = 135, various time periods are 
tested for (a) L ~ 2600 (1993 - 2003), (b) L ~ 950 (2000 - 
2003), and (c) L ~ 240 (2003). 



of the largest eigenvalue, the group part of intermedi- 
ate discrete eigenvalues, and the random part of small 
bulk eigenvalues. Then, we can separate the correlation 
matrix into three parts as 

C = C m + C 9 + C r 

Ng N-l 

= A o |0)(0| + ^A Q |a)H+ A Q |a)(a|, (5) 

a=l a=N g +l 

where C™ 1 , C 9 , and C indicate the marketwide ef- 
fect, the group correlation matrix, and the random noise 
terms, respectively. 

While the determination of C m is straightforward, it is 
not so clear to determine N g for separating C 9 and C. If 
there were no correlation between stock prices, the bulk 
eigenvalues have to follow Eq. ©, and thus the upper 
bound of the bulk eigenvalues can be clearly determined 
from Q. However, in empirical correlation matrix, the 
bulk eigenvalue spectrum deviates from Eq. (|3J) due to the 
coupling with underlying structured correlations, such as 
the group correlation embedded in C 9 0] . Therefore we 
use a graphical estimation to determine N g ; in the eigen- 
value spectrum as shown in Fig. |2b) we choose the cut 
N g = 9 in the vicinity of the blurred tail of the bulk 
part of the spectrum. Nevertheless, in spite of the rough 
estimation of N g , we note that our results in this work 
do not alter from a small change of N g , ~ ±1. This can 
be justified by the following arguments. In the group 
correlation matrix, the corresponding component of the 
eigenvalues close to the bulk part of the spectrum is con- 
fined to only a very small portion of the whole matrix; 
because the elements of the correlation matrix compo- 
nent A Q |a)(a| must be smaller than the eigenvalue A Q , 



large discrete eigenvalues dominantly contribute to the 
group correlation matrix. In addition, even if we count 
one less eigenvalue near the boundary of bulk part of the 
spectrum in constructing the group correlation matrix, a 
possible information loss of groups is not likely serious be- 
cause the pure eigenvectors of the groups generally turn 
out to be mixed all together in the eigenvectors of the 
correlation matrix (see Fig. Therefore the influence 
from the error in the determination of N g is insignificant 
so that it does not change the clustering result. 

This decomposition of the correlation matrix gives 
nontrivial characteristics to the distribution of the group 
correlation matrix elements . In Fig. Efc), it turns 
out that the distribution of Cfj shows positive heavy tail. 
This indicates that C 9 contains a non-negligible number 
of strongly correlated stock pairs, which is expected to 
come from the correlation between the stocks belonging 
to the same group. On the other hand, C r shows the 
Gaussian distribution consistent with the prediction of 
the random matrix theory [9j. While this Gaussian- like 
distribution is also observed partially in the distribution 
of Cfj due to the coupling between group correlations and 
random noises, it turns out that this remaining noise does 
not seriously affect the identification of stock groups. The 
distribution of Cy shows that C m also contains highly 
correlated stock pairs, but we find that is not rele- 
vant to the group correlation and thus have to be filtered 
out for the clear identification of the stock groups, which 
is discussed in Sec. Ill Bl 

Since the quality of the correlation matrix can depend 
on the period of empirical data or generally Q = L/N, 
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our decomposition of the correlation matrix can also de- 
pend on Q. Here we simply check how the determination 
of N g and the resulting matrix element distribution of 
the decomposed matrices are changed depending on Q 
(see Fig. EJ|. For Q = 19.4 (1993 - 2003) and Q = 7.0 
(1999 - 2003), C 9 and (7 are separated at N g = 10, 
which is not very different from N g = 9 of the larger 
dataset we use throughout this paper, and in addition, 
the distribution of the matrix element shows the simi- 
lar degree of the heavy tail in Cfy However, decreasing 
Q much smaller, the bulk eigenvalue spectrum becomes 
wider so that more eigenvalues relevant to the group cor- 
relation can be buried in the bulk spectrum, which leads 
to smaller N g that turns to be 7 for Q = 1.7 (2003). 
Even in this case of Q = 1.7, the positive heavy tail is 
still found in Cfj but very weaker than higher Q's. These 
imply that we need a large enough Q for the stock group 
identification. 
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FIG. 4: The comparison of the eigenvector of the largest 
eigenvalue obtained by the exact diagonalization m and the 
dominant term (w% + c )/(c a N 3/2 ) in Eq. ©. The dashed 
line has the slope 1.0. (Inset: the values of corresponding 
eigenvector components.) 



Our filtering is based on the following interpretations 
of the previous studies: the bulk part of the eigenvalues 
and their eigenvectors are expected to show the universal 
properties of the random matrix theory and the largest 
eigenvalue and its eigenvector are considered as a collec- 
tive response of the entire market H, 0, 0] . While the 
random characteristics of the bulk eigenvalues have been 
studied intensively, only the empirical tests have been 
done for the largest eigenvalue and its eigenvector so far 
Thus, to understand the more accurate meaning, 
we calculate the largest eigenvalue and its eigenvector of 
the correlation matrix by using perturbation theory. 

In stock markets, it has been understood that there 
exist three kinds of fluctuations in stock price changes: 
a marketwide fluctuation, synchronized fluctuations of 
stock groups, and a random fluctuation |g, |9j, . For 
simplicity, we consider a situation in which a system with 
only the marketwide fluctuation is perturbed by other 
fluctuations. Let us assume that the price changes of 
all the stocks in the market find a synchronized back- 
ground fluctuation with zero mean and variance Co as a 
marketwide effect. Then, we can write down the N x N 
unperturbed correlation matrix as 



relation matrix becomes 



C° = 



/ 1 c 
c 1 



CO \ 



CO 



(6) 



\c ■■■ c 1 J 



which has the largest eigenvalue Ag ' = cq(N — 1) + 1 
and its eigenvector components uf " 1 = ((stock) i\0^) — 

i/Vn. 

When a small perturbation is turned on, the total cor- 



C = C° + A, 



(7) 



where Aj, = and Ay = Ajj. Applying the perturba- 
tion theory up to the first order, the largest eigenvalue 
and the corresponding eigenvector components are easily 
calculated as 



Ao = 



co(iV-l) + l + l^A 



CqN 3 / 2 



3,k . 



(8) 



where wt = J2j 

We check the validity of Eqs. (JSJ) by comparing with 
the largest eigenvector obtained from the numerical diag- 
onalization of the empirical correlation matrix. For the 
comparison, we make the distribution of CV, in Eq. Q 
to be close to the empirical C'ij distribution by assuming 
that Ay follows the bell-shaped distribution with zero 
mean and letting Co to the mean value of the empirical 
Cij. Because the assumption not only reproduces the 
distribution of empirical C^ , but also allows us to ne- 
glect the 1/iV^Aij term in Eqs. (JSJ, we can directly 
compare the perturbation theory with the numerical re- 
sult. Figure 0] displays the eigenvector components of 
the largest eigenvalue obtained from the empirical corre- 
lation matrix and the dominant terms of Eqs. |JSJ, which 
show remarkable agreement with each other. 

Equation (JSJ indicates that the eigenvector of the 
largest eigenvalue is contributed by not only the global 
fluctuation but also the unknown perturbations from A 
including random noises. Thus, by filtering out the C m 



term, we can decrease the effect of unnecessary pertur- 
bations in constructing the group correlation matrix. In- 
deed, as seen in Fig. E[c), because the heavy tail part of 
C 9 , the highly correlated elements, are buried in C m , the 
clustering of stocks would be seriously disturbed unless 
C m is filtered out. 

In addition, Eqs. © also enable us to interpret more 
detailed meaning of the eigenvector than the marketwide 
effect. Because the ith eigenvector component is 
mostly determined by Wi , the sum of the correlation over 
all the other stocks, it can be regarded as the influencing 
power of the company in the entire stock market. In real 
data, the top four stocks with highest Wi are found to be 
General Electric (GE), American Express (AXP), Merrill 
Lynch (MER) , and Emerson Electric (EMR) , mostly con- 
glomerates or huge financial companies, which convinces 
us that m is indeed representing the influencing power 
of stock i. However, these high influencing companies 
prevent clear clustering of stocks because of their non- 
negligible correlations with entire stocks in the market. 
This is easily comprehensible by considering an analo- 
gous situation in a network where the big hub, a node 
with a large number of links, can make indispensable 
connections between groups of nodes to cause difficulties 
in distinguishing the groups |16| . Therefore it is very im- 
portant to filter out C m in order to identify the groups 
of stocks efficiently. 



III. IDENTIFICATION OF STOCK GROUPS 

In the group model for stock price correlation proposed 
by Noh [13, the correlation matrix C takes the form of 
C = C 9 + C r , where C 9 and C r are the correlation 
matrix of stock groups and random correlation matrix, 
respectively. The model assumes the ideal situation with 
0?j = 8 ai>OCj , where a, indicates the group to which the 
stock i belongs. Thus C 9 is the block diagonal matrix, 



C 9 



li 

V ••• 



\ 



!„/ 



(9) 



where li is the Ni x Ni matrix (Ni is the number of stocks 
in the ith group) of which all elements are 1. 

Here we use this group model to find the groups of 
stocks. If the correlation matrix in the real market can 
be represented by the block diagonal matrix as in the 
model, it would be very easy to identify the groups of 
stocks. However, there exist infinitely many possible rep- 
resentations of the matrix depending on indexing of rows 
and columns even if we have a matrix equivalent to the 
block diagonal matrix. For instance, if we exchange the 
indices of the matrix (e.g., {i,j, k} — > {k,i, j}) the matrix 
may not be block-diagonal anymore. Therefore the prob- 
lem in identifying the groups in stock correlation matrix 
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FIG. 5: (Color online) The visualization of the group corre- 
lation matrix with the optimized stock sequence {h}. 



requires one to find out the optimized sequence of stocks 
to transform the matrix into the well-organized block di- 
agonal matrix 0] . 

To optimize the sequence of stocks for clear block di- 
agonalization, we consider the correlation between two 
stocks as an attraction force between them. For the ideal 
group correlation matrix in the group model, the block 
diagonal form is evidently the most stable form if the 
attractive force between stocks is proportional to their 
correlation within the group. To deal with the real cor- 
relation matrix, we define the total energy for a stock 
sequence as 



£t o * = £<^i-y0(<^-c c ), 



(10) 



where U is the location of the stock i in the new index 
sequence and the cutoff c c = 0.1 is introduced to get rid 
of the random noise part which still remains in C 9 in 
spite of the filtering [2(j . 

We obtain the optimized sequence of stocks to mini- 
mize the total energy defined in Eq. (|1U|I by using the 
simulated annealing technique [2l| in Monte Carlo simu- 
lation. The following description of our problem is very 
similar to the well-known traveling salesman problem, 
finding an optimized sequence of visiting cities which 
minimizes total traveling distance p^ : 

1. Configuration. The stocks are numbered i = 
0, . . . , N — 1. A configuration, a sequence of stocks 
{h}, is a permutation of the numbers 0, . . . , N — 1. 

2. Rearrangements. A randomly chosen stock in the 
sequence is removed and inserted at the random 
position of the sequence. 

3. Objective function. We use E to t in Eq. IjlOfl as an 
objective function to be minimized after rearrange- 
ments. 
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TABLE I: The full list of the optimized sequence of stocks. The footnotes correspond to the identified stock groups represented 
by the same footnotes in Fig. |S] 



Pi 


Ticker 


Sector 


V^ 


Ticker 


Sector 


Pi 


Ticker 


Sector 





XNR 


Services 


45 


G 


Consumer noncyclical 5 


90 


AMR 


Transportation ' 


1 


WMB 


Utilities 


46 


AVP 


Consumer noncyclical 5 


91 


F 


Consumer cyclical' 


2 


VLO 


Energy 


47 


MCD 


Services 


92 


GM 


Consumer cyclical 7 


3 


NBL 


Energy 1 


48 


IFF 


Basic materials 


93 


HPC 


Basic materials 8 


4 


APA 


Energy 


49 


WMT 


Services 


94 


DD 


Basic materials 


5 


KMC 


Energy 


50 


FNM 


Financial 


95 


CAT 


Capital goods 8 


6 


HAL 


Energy 


51 


EC 


Consumer cyclical 


96 


DOW 


Basic materials 


7 


SLB 


Energy 


52 


KR 


Services 


97 


WY 


t-* • 1 R 

Basic materials 


8 


BP 


Energy 


53 


HET 


Services 


98 


IP 


t-» • 1 R 

Basic materials 


9 


COP 


Energy 


54 


TXI 


Capital goods 


99 


GP 


Basic materials 


10 


CVX 


T— 1 1 

Energy 


55 


FO 


Conglomerates 


100 


BCC 


t-» • 1 R 

Basic materials 


11 


OXY 


Energy 


56 


SKY 


Capital goods 


101 


AA 


T-l • 1 R 

Basic materials 


12 


RD 


i-i 1 

Energy 


57 


FLE 


Capital goods 


102 


PD 


T"» • 1 R 

Basic materials 


13 


MRO 


Energy 


58 


RSH 


Services 


103 


LPX 


Basic materials 


14 


XOM 


Energy 


59 


EK 


Consumer cyclical 


104 


N 


t-» • 1 R 

Basic materials 


15 


PGL 


Utilities 2 


60 


EMR 


Conglomerates 


105 


DE 


Capital goods 


16 


CNP 


Utilities 2 


61 


TOY 


Services 


106 


PBI 


Technology 


17 


ETR 


Utilities 2 


62 


TEN 


Consumer cyclical 


107 


BDK 


Consumer cyclical 


18 


DTE 


Utilities 2 


63 


ROK 


Technology 


108 


UNP 


Transportation 9 


19 


EXC 


Utilities 2 


64 


HON 


Capital goods 


109 


NSC 


Transportation 9 


20 


AEP 


Utilities 2 


65 


AXP 


Financial 


110 


CSX 


Transportation 9 


21 


PEG 


Utilities 2 


66 


GRA 


Basic materials 


111 


BNI 


Transportation 9 


22 


SO 


Utilities 2 


67 


VVI 


Services 


112 


CNF 


Transportation 9 


23 


ED 


Utilities 2 


68 


CSC 


Technology 6 


113 


MAT 


Consumer cyclical 


24 


PCG 


Utilities 2 


69 


DBD 


Technology 6 


114 


C 


Financial 


25 


EIX 


Utilities 2 


70 


HRS 


Technology 6 


115 


VIA 


Services 


26 


LMT 


Capital goods 3 


71 


STK 


Technology 6 


116 


MMM 


Conglomerates 


27 


NOC 


Capital goods 3 


72 


ZL 


Technology 6 


117 


DIS 


Services 


28 


RTN 


Conglomerates 3 


73 


TEK 


Technology 6 


118 


BC 


Consumer cyclical 


29 


GD 


Capital goods 3 


74 


AVT 


Technology 6 


119 


CBE 


Technology 


30 


BA 


Capital goods 3 


75 


GLW 


Technology 6 


120 


THC 


TT 1 , If) 

Healthcare 


31 


BOL 


Healthcare 4 


76 


NSM 


Technology 6 


121 


HUM 


1 — , . -Tin 
Financial 


32 


MDT 


Healthcare 4 


77 


TXN 


Technology 6 


122 


AET 


Financial 10 


33 


BAX 


Healthcare 4 


78 


MOT 


Technology 6 


123 


CI 


Financial 


34 


WYE 


Healthcare 4 


79 


HPQ 


Technology 6 


124 


JCP 


Services 


35 


BMY 


Healthcare 4 


80 


NT 


Technology 6 


125 


MEE 


Energy 


36 


LLY 


Healthcare 4 


81 


IBM 


Technology 6 


126 


GE 


Conglomerates 


37 


MRK 


Healthcare 4 


82 


UIS 


Technology 6 


127 


UTX 


Conglomerates 


38 


PFE 


Healthcare 4 


83 


XRX 


Technology 6 


128 


R 


Services 


39 


JNJ 


Healthcare 4 


84 


T 


Services 


129 


NVO 


Healthcare 


40 


PEP 


Consumer noncyclical 5 


85 


HIT 


Capital goods 


130 


GT 


Consumer cyclical 


41 


KO 


Consumer noncyclicaL 


86 


MER 


Financial 


131 


S 


Services 


42 


PG 


Consumer noncyclical 5 


87 


FDX 


Transportation 7 


132 


NAV 


Consumer cyclical 


43 


MO 


Consumer noncyclicaL 


88 


LUV 


Transportation 7 


133 


CEN 


Technology 


44 


CL 


Consumer noncyclical 5 


89 


DAL 


Transportation 7 


134 


FL 


Services 



Figure [5] visualizes the correlation matrix elements 
Cf ; . with the most optimized sequence {U} and TableQ] 
lists the optimized sequence of stocks. The multiple in- 
dependent blocks of highly correlated correlations in the 
matrix are clearly visible without any a priori knowledge 
of stocks, i.e., the stocks in different blocks are believed to 
belong to different groups. We succeed to identify about 
70% of the entire 135 stocks from the blocks, which are 
listed in Tableland it turns out that most of the stocks 



in a block are represented by a single industry sector or 
a detailed industrial classification such as aerospace and 
defense, airline transport, railroad, and insurance (see 
Fig-EJ). There still remain a small number of ungroupcd 
stocks, which arises from the fact that the correlations 
between them are too weak to be distinguished from the 
random noise that still exists in the group correlation 
matrix. 

As an alternative method, we also perform a network- 
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FIG. 6: The dependence of the number of isolated clusters, 
m, in the stock network on the threshold p in constructing 
the network from (a) the group correlation matrix and (b) 
the full correlation matrix. 



based approach to find the groups of stocks. In principle, 
the correlation matrix can be treated as an adjacency 
matrix of the weighted network of stocks, in which the 
weights indicate how closely correlated the stocks are in 
their price changes |23| . However, for the simplicity and 
the clear definition of groups in the network, we consider 
the binary network of stocks which permits only two pos- 
sible states of a stock pair, connected or disconnected. 

To construct the binary network of stocks, we use the 
percolation approach because of its usefulness of find- 
ing groups. The method is very simple: for each pair 
of stocks, we connect them if the group correlation co- 
efficient Cfj is larger than a preassigned threshold value 
p. If the heavy tail in the distribution of Cf^ in Fig. |3 
mostly comes from the correlation between the stocks in 
the same group, an appropriate choice of p = p c will 
give several meaningful isolated clusters, to, in the net- 
work which are expected to be identified as different stock 
groups. 

We determine p c by observing the change of the net- 
work structure as p decreases. Figure EJa) displays the 
number of isolated clusters in the network as a function 
of threshold p. As we decrease p, the number of isolated 
clusters in the network increases slowly and stays near 
the maximum value up to p = 0.1, and then it abruptly 
decreases to 1, which indicates there exists only one iso- 
lated cluster. Therefore we choose p c = 0.1 to construct 
the most clustered but stable stock network p4| . 

We find that the constructed network consists of sepa- 
rable groups of stocks which correspond to the industrial 
sectors of stocks (see Fig. [7J). At p c — 0.1, the network 
has 92 nodes and 357 links. The identification of stock 
group is very clear because the clusters in the network, 
which we consider to be equivalent to stock groups, are 
fully connected networks or very dense networks in which 
most of the nodes in the cluster are directly connected. 



However, although most of the stock groups are repre- 
sented by a single industrial sector, it is found that the 
stocks which belong to two different industrial sectors co- 
exist in a cluster. For instance, the stocks in the health- 
care sector and the noncyclical consumer sector cannot 
be separable in this network. Indeed, in Fig. [3J one can 
observe non-negligible correlation between the healthcare 
and the noncyclical consumer, which indicates the pres- 
ence of an intergroup correlation. In the real market, 
this presence of such an intersector correlation can be 
expected and our clustering results shown in Figs. [3J and 
present both of intergroup and intragroup correlations 
that exist in the real stock market. 

The group identification based on the eigenvector anal- 
ysis of the stock price correlation matrix has been stud- 
ied by several research groups [tj 0, [n). In spite of 
their pioneering achievements to reveal the localization 
properties of eigenvectors, the classification of stocks into 
groups was not so clear, and it only covered about 10% 
of their stocks because they used only the few highest 
contributions of eigenvector components due to the am- 
biguity explained in Sec. [fl In this work, we not only 
introduce a more refined and systematic method to iden- 
tify the stock groups, but also successfully cluster about 
70% of stocks into groups although direct comparison of 
the success ratio might be inappropriate because our data 
set is different from theirs. 

On the other hand, Onnela et al. &] introduced the 
percolation approach to construct the stock network in 
which the links are added between stocks one by one in 
descending order from the highest element of the full cor- 
relation matrix. In their work, though highly correlated 
groups of stocks were found, the threshold value of the 
correlation to settle the network structure was hardly de- 
termined; the number of isolated clusters according to the 
threshold did not show the clear cut. We believe that this 
is attributed to the fact that they used the full correlation 
matrix carrying marketwide and random fluctuation. We 
would also fail to determine the critical threshold value 
of correlation if we use the full correlation matrix instead 
of the filtered one [see Fig.[|Jb)]. This indicates that the 
filtering is crucial for the stock group identification. 

Finally, we note that Marsili et al. introduced a dif- 
ferent method to filter noises from the time series of 
stock price log-returns for stock group identification. In 
their work, it was assumed that the normalized log-return 
could be expressed by the linear combination of the noise 
at individual stock level and the noise at the level of the 
groups, which fitted to the real data to determine the 
weights of two noises and the constituents of the groups. 
However, we found that the effect of the inhomogeneous 
marketwide fluctuation is quite significant that the mar- 
ketwide effect needs to be considered seriously to de- 
scribe the correlation between stock correctly. Indeed, 
it is found that the filtering out of the corresponding C m 
improves the clustering result. 
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FIG. 7: The stock network with the threshold p c = 0.1. The thickness of links indicates the strength of the correlation in the 
group correlation matrix. 



IV. CONCLUSION 

In conclusion, we successfully identify the multiple 
group of stocks from the empirical correlation matrix of 
stock price changes in the New York Stock Exchange. We 
propose refined methods to find stock groups which dra- 
matically reduce ambiguities as compared to identifying 
stock groups from the localization in a single eigenvector 
of the correlation matrix [jj, IToL [ll| . From the analysis 
of the characteristics of eigenvectors, we construct the 
group correlation matrix of the stock groups excluding 
the marketwide effect and random noise. By optimiz- 
ing the representation of the group correlation matrix, 
we find that the group correlation matrix is represented 



by the block diagonal matrix where the stocks in each 
block belong to the same group. This coincides with the 
theoretical model of Noh ■ Equally good stock group 
identification is also achieved by the percolation approach 
on the group correlation matrix to construct the network 
of stocks. 
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